This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or ‘workflow’ can be divided up into different components; it is recommend you read more about that in the introduction notebook.

In this notebook specifically, we investigate the effect of varying the Unit scale component on the outcome of the differential expression results. The three component variants are: log2 intensity, (untransformed) intensity, ratio.

The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the ‘Code’ button. Each code section can be expanded in a similar fashion. You can also download the entire notebook source code.

library(stringi)
library(gridExtra)
library(dendextend)
library(kableExtra)
library(limma)
library(psych)
library(tidyverse)
source('other_functions.R')
source('plotting_functions.R')

Let’s load our PSM-level data set:

data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
display_dataframe_head(dat.l)
Mixture TechRepMixture Condition BioReplicate Run Channel Protein Peptide RT Charge PTM intensity TotalIntensity Ions.Score DeltaMZ isoInterOk noNAs onehit.protein shared.peptide
Mixture2 1 1 Mixture2_1 Mixture2_1 127C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 44048.41 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.5 Mixture2_0.5 Mixture2_1 127N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 38298.89 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.125 Mixture2_0.125 Mixture2_1 128C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 39552.81 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 128N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 56730.70 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 129C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 48744.49 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 1 Mixture2_1 Mixture2_1 129N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 60179.96 394657.6 65 0.00052 TRUE TRUE FALSE FALSE

After the filtering done in data_prep.R, there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.

# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
remove_factors(spiked.proteins)
##  [1] "P02787ups"   "P02788ups"   "P01133ups"   "P69905ups"   "P01008ups"  
##  [6] "P02741ups"   "P01375ups"   "P10636-8ups" "O00762ups"   "P62988ups"  
## [11] "P10145ups"   "P05413ups"   "P00915ups"   "P02753ups"   "P02144ups"  
## [16] "P01344ups"   "P01579ups"   "P00709ups"   "P01127ups"
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
  sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
  if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
  dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}

We store the metadata in sample.info and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.

display_dataframe_head(sample.info)
Run Run.short Channel Sample Sample.short Condition Color
Mixture1_1 Mix1_1 127C Mixture1_1:127C Mix1_1:127C 0.125 black
Mixture1_1 Mix1_1 127N Mixture1_1:127N Mix1_1:127N 0.667 green
Mixture1_1 Mix1_1 128C Mixture1_1:128C Mix1_1:128C 1 red
Mixture1_1 Mix1_1 128N Mixture1_1:128N Mix1_1:128N 0.5 blue
Mixture1_1 Mix1_1 129C Mixture1_1:129C Mix1_1:129C 0.5 blue
Mixture1_1 Mix1_1 129N Mixture1_1:129N Mix1_1:129N 0.125 black
referenceCondition
## [1] "0.5"
channelNames
## [1] "127C" "127N" "128C" "128N" "129C" "129N" "130C" "130N"

1 Unit scale component

In the next three subsections, let’s look at our different ways to summarize quantification values from PSM to peptide (first step) to protein (second step).

dat.unit.l <- emptyList(variant.names)

1.1 log2 transformation of reporter ion intensities

Taking the log2 transform is a widely used approach. It renders the distribution of values more symmetrical, which is often presumed to be log-normal.

dat.unit.l$log2_intensity <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)

Alternatively, one can use untransformed intensities.

1.2 intensities on the original scale

dat.unit.l$intensity <- dat.l %>% rename(response=intensity)

1.3 intensity ratios

Using ratios is an approach that has been around for a long time, as it inherently mitigates some of the unwanted variance. There are many ways to define ratios, some of which are wiser than others. We take one of the best approaches that we assume may be commonly used by researchers, which is to divide each sample’s quantification values by the average of the corresponding feature values of the reference condition samples in that same run.

# use half-wide data to compute within-run average of PSM channels corresponding to the reference Condition
refCols <- sample.info %>% filter(Condition==referenceCondition) %>% distinct(Channel) %>% pull
denom.df=dat.l %>% filter(Condition==referenceCondition) %>% pivot_wider(id_cols=-one_of('Condition', 'BioReplicate'),names_from='Channel', values_from='intensity')
denom.df$denom=apply(denom.df[,refCols], 1, function(x) mean(x, na.rm=T))
denom.df=denom.df[,c('Run', 'Protein', 'Peptide', 'RT', 'Charge', 'PTM', 'denom')]
dat.unit.l$ratio <- dat.l %>% left_join(denom.df, by=c('Run', 'Protein', 'Peptide', 'RT', 'Charge', 'PTM')) %>% mutate(response=intensity/denom) %>% select(-c(intensity, denom)) 

2 Normalization component: medianSweeping (1)

Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division. Since median sweeping needs to be applied on matrix-like data, let’s switch to wide format. (Actually, this is semi-wide, since the Channel columns still have contributions form all Runs, but that’s OK because in the next step we split by Run.)

# switch to wide format
dat.unit.w <- lapply(dat.unit.l, function(x){
  pivot_wider(data = x, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)})

Now let’s sweep the medians of all the rows. No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.

# subtract the spectrum median log2intensity from the observed log2intensities
dat.norm.w <- dat.unit.w
dat.norm.w$log2_intensity[,channelNames] <- median_sweep(dat.norm.w$log2_intensity[,channelNames], 1, '-')
dat.norm.w$intensity[,channelNames] <- median_sweep(dat.norm.w$intensity[,channelNames], 1, '/')
dat.norm.w$ratio[,channelNames] <- median_sweep(dat.norm.w$ratio[,channelNames], 1, '/')

These (partially) normalized quantification values are now already comparable, but after summarization we will also sweep the columns on the protein level, as suggested by Herbrich at al..

3 Summarization component: Median summarization

We summarize quantification values from PSM to peptide (first step) to protein (second step) within each sample. Median summarization is simple: within each Run and within each Channel, we replace multiple related observations with their median. First, for each Peptide (median of the PSM values), then for each Protein (median of the peptide values).

# normalized data
dat.norm.summ.w <- lapply(dat.norm.w, function(x){
  y <- x %>% group_by(Run, Protein, Peptide) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% ungroup()
  return(y)})

Let’s also summarize the non-normalized data for comparison later on.

# non-normalized data
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
dat.nonnorm.summ.w <- lapply(dat.unit.w, function(x) {
  y <- x %>% group_by(Run, Protein, Peptide) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% ungroup()
  return(y)})

4 Normalization component: medianSweeping (2)

Now that the data is on the protein level, let’s sweep all values separately per protein in the columns/samples. This is slightly different from sweeping before the summarization step because the median of medians is not the same as the grand median, but this does not introduce any bias.

# medianSweeping: in each channel, subtract/divide median computed across all proteins within the channel
# do the above separately for each MS run
second_median_sweep <- function(dat, fun){
  dat.split <- split(dat, dat$Run) 
  dat.split.norm  <- lapply(dat.split, function(y) {
    y[,channelNames] <- median_sweep(y[,channelNames], 2, fun); return(y)})
  return(bind_rows(dat.split.norm))
}
dat.norm.summ.w$log2_intensity <- second_median_sweep(dat.norm.summ.w$log2_intensity, '-')
dat.norm.summ.w$intensity <- second_median_sweep(dat.norm.summ.w$intensity, '/')
dat.norm.summ.w$ratio <- second_median_sweep(dat.norm.summ.w$ratio, '/')

5 QC plots

Before getting to the DEA section, let’s do some basic quality control and take a sneak peek at the differences between the component variants we’ve chosen. First, however, we should make the data completely wide, so that each sample gets it’s own unique column.

# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) {
  return( x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}") )
})
colnames(dat.norm.summ.w2[[1]])
##  [1] "Protein"         "Mixture1_1:127C" "Mixture1_2:127C" "Mixture2_1:127C"
##  [5] "Mixture2_2:127C" "Mixture1_1:127N" "Mixture1_2:127N" "Mixture2_1:127N"
##  [9] "Mixture2_2:127N" "Mixture1_1:128C" "Mixture1_2:128C" "Mixture2_1:128C"
## [13] "Mixture2_2:128C" "Mixture1_1:128N" "Mixture1_2:128N" "Mixture2_1:128N"
## [17] "Mixture2_2:128N" "Mixture1_1:129C" "Mixture1_2:129C" "Mixture2_1:129C"
## [21] "Mixture2_2:129C" "Mixture1_1:129N" "Mixture1_2:129N" "Mixture2_1:129N"
## [25] "Mixture2_2:129N" "Mixture1_1:130C" "Mixture1_2:130C" "Mixture2_1:130C"
## [29] "Mixture2_2:130C" "Mixture1_1:130N" "Mixture1_2:130N" "Mixture2_1:130N"
## [33] "Mixture2_2:130N"
## non-normalized data
dat.nonnorm.summ.w2 <- lapply(dat.nonnorm.summ.w, function(x) {
  return( x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}") )
})

5.1 Boxplots

These boxplots that for all variants the distributions are very similar and symmetrical. In contrast, the Sum summarization produces very skewed distributions. The means of the distributions for multiplicative unit scales (intensity, ratio) are 1 instead of zero because there we do median sweeping using division instead of subtraction. Although for all summarization methods the boxplots are centered after normalization, this skewness of the Sum summarized values is ominous.

# use (half-)wide format
par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
  boxplot_w(dat.nonnorm.summ.w[[variant.names[i]]],sample.info, paste('raw', variant.names[i], sep='_'))
  boxplot_w(dat.norm.summ.w[[variant.names[i]]], sample.info, paste('normalized', variant.names[i], sep='_'))}

5.2 MA plots

We then make MA plots of two single samples taken from condition 0.5 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes.

for (i in 1:n.comp.variants){
  p1 <- maplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale=scale.vec[i], paste('raw', variant.names[i], sep='_'), spiked.proteins)
  p2 <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale=scale.vec[i], paste('normalized', variant.names[i], sep='_'), spiked.proteins)
grid.arrange(p1, p2, ncol=2)}

To increase the robustness of these results, let’s make some more MA plots, but now for all samples from condition 0.5 and condition 0.125 (quantification values averaged within condition). Both the unnormalized and normalized data now show less variability, and extremely so in the case of unnormalized ratios. Using more samples (now 8 in both the enumerator and denominator instead of just one) in the fold change calculation makes the rolling average more robust. Also, it seems the spike-in proteins induce a small positive bias (blue curve is rolling average) for low abundance proteins after normalization.

channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% select(Sample) %>% pull %>% as.character
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% select(Sample) %>% pull %>% as.character
for (i in 1:n.comp.variants){
  p1 <- maplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale=scale.vec[i], paste('raw', variant.names[i], sep='_'), spiked.proteins)
  p2 <- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale=scale.vec[i], paste('normalized', variant.names[i], sep='_'), spiked.proteins)
grid.arrange(p1, p2, ncol=2)}

5.3 PCA plots

Now, let’s check if these multi-dimensional data contains some kind of grouping; It’s time to make PCA plots.

5.3.1 Using all proteins

Even though for all variants PC1 seems to capture the conditions, providing a gradient for the dilution number, only the 0.125 condition is completely separable in the normalized data.

par(mfrow=c(1, 2))
for (i in 1:n.comp.variants){
  if (variant.names[i]=='intensity') pca.scale=TRUE else pca.scale=FALSE
  pcaplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('raw', variant.names[i], sep='_'), scale=pca.scale)
  pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small amount in both relative (to the 4083 proteins total) and absolute (for a biological sample) terms.

5.3.2 Using spiked proteins only

Therefore, let’s see what the PCA plots look like if we were to only use the spiked proteins in the PCA. Now, the use of ratios almost makes the raw data separable. After normalization, all variants produce a similar plots where only conditions 0.5 and 0.667 aren’t clearly separable.

par(mfrow=c(1, 2))
for (i in 1:n.comp.variants){
  if (variant.names[i]=='intensity') pcaplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('raw', variant.names[i], sep='_'), scale=T) else pcaplot_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('raw', variant.names[i], sep='_'))
  pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins. In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.

5.4 HC (hierarchical clustering) plots

The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all. This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).

par(mfrow=c(1,2))
for (i in 1:n.comp.variants){
  dendrogram_ils(dat.nonnorm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('raw', variant.names[i], sep='_'))
  dendrogram_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

5.5 Run effect p-value plot

Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. Below are the distributions of p-values from a linear model for the response variable with Run as a covariate. If the run effect was removed successfully, these p-values ought to be large. Clearly, the raw data contains a run effect, which all variants are able to partially remove.

plots <- vector('list', n.comp.variants)
for (i in 1:n.comp.variants){
dat <- list(dat.nonnorm.summ.l[[variant.names[i]]], dat.norm.summ.l[[variant.names[i]]])
names(dat) <- c(paste('raw', variant.names[i], sep='_'), paste('normalized', variant.names[i], sep='_'))
plots[[i]] <- run_effect_plot(dat)}
grid.arrange(grobs = plots, nrow=n.comp.variants)

6 DEA component: Moderated t-test

We look at the log2 fold changes of each condition w.r.t. the reference condition with dilution ratio 0.125. Since we are working with a log2 unit scale already, this means that for each protein we just look at the difference in mean observation across all channels between one condition and the reference condition. Note that this is not the same as looking at the log2 of the ratio of mean raw intensities for each condition (left hand side below), nor the mean ratio of raw intensities for each condition (right hand side below), since \(log_2 (\frac{mean(B)}{mean(A)}) \neq \frac{mean(log_2 (B))}{mean(log_2 (A))} \neq mean(\frac{log_2 (B)}{log_2 (A)})\).

To check whether these fold changes are significant (criterium: \(q<0.05\)), we use a moderated t-test slightly adapted from the limma package, which in use cases like ours should improve statistical power over a regular t-test. In a nutshell, this is a t-test done independently for each protein, although the variance used in the calculation of the t-statistic is moderated using some empirical Bayes estimation. After testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.

#{ INVESTIGATE late log2 transform
dat.norm.summ.w2$intensity_lateLog2 <- dat.norm.summ.w2$intensity
channelNames.prefixed <- colnames(dat.norm.summ.w2$intensity %>% select(-Protein))
dat.norm.summ.w2$intensity_lateLog2[,channelNames.prefixed] <- log2(dat.norm.summ.w2$intensity[,channelNames.prefixed])
variant.names <- names(dat.norm.summ.w2)
scale.vec <- c(scale.vec, 'log')
n.comp.variants <- n.comp.variants + 1
#}
# design matrix as used in ANOVA testing.
design.matrix <- get_design_matrix(referenceCondition, sample.info)
dat.dea <- emptyList(names(dat.norm.summ.w2))
for (i in 1:n.comp.variants) {
  # provide scale so moderated_ttest knows whether you input log2 or raw intensities.
  this_scale <- scale.vec[match(names(dat.dea)[i], variant.names)]
  d <- column_to_rownames(as.data.frame(dat.norm.summ.w2[[variant.names[i]]]), 'Protein')
  dat.dea[[variant.names[i]]] <- moderated_ttest(dat=d, design.matrix, scale=this_scale)
}

For each condition, we now get the fold changes, moderated and unmoderated p-values, moderated and unmoderated q-values (BH-adjusted p-values), and some other details (head of dataframe below).

display_dataframe_head(dat.dea[[1]])
logFC_0.125 logFC_0.667 logFC_1 t.ord_0.125 t.ord_0.667 t.ord_1 t.mod_0.125 t.mod_0.667 t.mod_1 p.ord_0.125 p.ord_0.667 p.ord_1 p.mod_0.125 p.mod_0.667 p.mod_1 q.ord_0.125 q.ord_0.667 q.ord_1 q.mod_0.125 q.mod_0.667 q.mod_1 df.r df.0 s2.0 s2 s2.post
A0AVT1 -0.0296332 -0.0190075 0.0086228 -1.2236300 -0.7848676 0.3560550 -1.1698892 -0.7503969 0.3404173 0.2312896 0.4391218 0.7244682 0.2512504 0.4588575 0.7359131 0.9569613 0.9960123 0.9898622 0.9350748 0.9805978 0.9840728 28 2.018963 0.0056242 0.0023459 0.0025664
A0FGR8 0.0121287 0.0389655 0.0247813 0.3627376 1.1653597 0.7411452 0.3596350 1.1553923 0.7348061 0.7195248 0.2537054 0.4647764 0.7216378 0.2570437 0.4681594 0.9841943 0.9960123 0.9820911 0.9795638 0.9723943 0.9650451 28 2.018963 0.0056242 0.0044720 0.0045495
A0MZ66 0.0035130 -0.0009938 -0.0183834 0.0761086 -0.0215311 -0.3982773 0.0769941 -0.0217816 -0.4029113 0.9398739 0.9829747 0.6934465 0.9391391 0.9827663 0.6898687 0.9978394 0.9975071 0.9898622 0.9964971 0.9975410 0.9825206 28 2.018963 0.0056242 0.0085220 0.0083271
A1L0T0 0.0520381 -0.0328378 -0.0500166 0.7951963 -0.5017969 -0.7643056 0.8165216 -0.5152539 -0.7848025 0.4358354 0.6212896 0.4536089 0.4229483 0.6115101 0.4409319 0.9569613 0.9975071 0.9820911 0.9350748 0.9924981 0.9650451 20 2.018963 0.0056242 0.0128474 0.0121851
A3KMH1 0.4342224 0.3206653 0.5484563 1.7510253 1.2931002 2.2116798 1.8344616 1.3547163 2.3170662 0.0952654 0.2107143 0.0387933 0.0801348 0.1892462 0.0301887 0.9569613 0.9960123 0.7960743 0.9350748 0.9540539 0.7680706 20 2.018963 0.0056242 0.1844849 0.1680848
A4D1E9 -0.0219254 -0.0731552 -0.0426655 -0.4771987 -1.5921945 -0.9285977 -0.4796669 -1.6004298 -0.9334006 0.6383938 0.1270236 0.3641701 0.6361944 0.1237544 0.3607366 0.9741019 0.9841938 0.9820911 0.9644451 0.9260254 0.9650451 20 2.018963 0.0056242 0.0063331 0.0062681

7 Results comparison

Now, the most important part: let’s find out how our component variants have affected the outcome of the DEA.

7.1 Confusion matrix

A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value.

Clearly, all variants perform relatively well across the board. That said, the contrast between conditions 0.667 and 0.5 seems not large enough to yield many significant results. Notice also how we sneaked the ad-hoc variant ‘intensity_lateLog2’ in there: it is the normalized intensities which are then log2-transformed before DEA. Apparently, those results lie quite close to those of the ‘regular’ log2 unit scale.

cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
0.125 vs 0.5 contrast
log2_intensity
intensity
ratio
intensity_lateLog2
background spiked background spiked background spiked background spiked
not_DE 4061 4 4062 6 4062 6 4061 4
DE 3 15 2 13 2 13 3 15
0.125 vs 0.5 contrast
log2_intensity intensity ratio intensity_lateLog2
Accuracy 0.998 0.998 0.998 0.998
Sensitivity 0.789 0.684 0.684 0.789
Specificity 0.999 1.000 1.000 0.999
PPV 0.833 0.867 0.867 0.833
NPV 0.999 0.999 0.999 0.999
0.667 vs 0.5 contrast
log2_intensity
intensity
ratio
intensity_lateLog2
background spiked background spiked background spiked background spiked
not_DE 4064 18 4063 15 4063 15 4064 18
DE 0 1 1 4 1 4 0 1
0.667 vs 0.5 contrast
log2_intensity intensity ratio intensity_lateLog2
Accuracy 0.996 0.996 0.996 0.996
Sensitivity 0.053 0.211 0.211 0.053
Specificity 1.000 1.000 1.000 1.000
PPV 1.000 0.800 0.800 1.000
NPV 0.996 0.996 0.996 0.996
1 vs 0.5 contrast
log2_intensity
intensity
ratio
intensity_lateLog2
background spiked background spiked background spiked background spiked
not_DE 4060 5 4058 3 4058 3 4060 5
DE 4 14 6 16 6 16 4 14
1 vs 0.5 contrast
log2_intensity intensity ratio intensity_lateLog2
Accuracy 0.998 0.998 0.998 0.998
Sensitivity 0.737 0.842 0.842 0.737
Specificity 0.999 0.999 0.999 0.999
PPV 0.778 0.727 0.727 0.778
NPV 0.999 0.999 0.999 0.999

7.2 Correlation scatter plots

To see whether the three Summarization methods produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case).

# character vectors containing logFC and p-values columns
dea.cols <- colnames(dat.dea[[1]])
logFC.cols <- dea.cols[stri_detect_fixed(dea.cols, 'logFC')]
significance.cols <- dea.cols[stri_detect_fixed(dea.cols, 'q.mod')]
n.contrasts <- length(logFC.cols)

scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)

scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)

For all conditions, we see that all variants are well correlated (both their q-values and fold changes). The lateLog2 results are not identical but merely similar to those associated with the log2 intensity scale. Remarkably, though, the intensity and ratio results aren’t just similar but identical. That is because median sweeping and taking ratios (which are then used to calculate fold changes) are commutative with respect to each other, as long as either the denominator of the ratios is always the same.

7.3 Volcano plots

The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins.

for (i in 1:n.contrasts){
  volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}

7.4 Violin plots

A good way to assess the general trend of the fold change estimates on a more ‘macroscopic’ scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.

Clearly, the empirical results tend towards the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here.

All variants give rise to very similar distributions.

# plot theoretical value (horizontal lines) and violin per variant
violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition)

8 Conclusions

For the given data set, the differences in proteomic outcomes between all unit scale variants (log2 intensity, intensity, ratio) are quite small. The QC plots suggest that they produce qualitative outcomes, although the fold changes seem to experience an unusually large amount of ratio compression (probably inherent to the data set rather than the methodology). Using normalized ratios is identical to using untransformed intensities, and if you want to work on log2 scale, it doens’t seem to matter whether you take the transform in the beginning or right before the DEA step.

9 Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_BE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_BE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2       purrr_0.3.4      
##  [5] readr_1.4.0       tidyr_1.1.2       tibble_3.0.4      ggplot2_3.3.2    
##  [9] tidyverse_1.3.0   psych_2.0.9       limma_3.45.19     kableExtra_1.3.1 
## [13] dendextend_1.14.0 gridExtra_2.3     stringi_1.5.3    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-150         fs_1.5.0             lubridate_1.7.9     
##  [4] webshot_0.5.2        httr_1.4.2           tools_4.0.3         
##  [7] backports_1.1.10     R6_2.4.1             rpart_4.1-15        
## [10] DBI_1.1.0            mgcv_1.8-33          colorspace_1.4-1    
## [13] nnet_7.3-14          withr_2.3.0          tidyselect_1.1.0    
## [16] mnormt_2.0.2         compiler_4.0.3       cli_2.1.0           
## [19] rvest_0.3.6          xml2_1.3.2           labeling_0.4.2      
## [22] scales_1.1.1         digest_0.6.27        rmarkdown_2.5       
## [25] pkgconfig_2.0.3      htmltools_0.5.0      dbplyr_1.4.4        
## [28] highr_0.8            rlang_0.4.8          readxl_1.3.1        
## [31] rstudioapi_0.11      farver_2.0.3         generics_0.0.2      
## [34] jsonlite_1.7.1       ModelMetrics_1.2.2.2 magrittr_1.5        
## [37] Matrix_1.2-18        Rcpp_1.0.5           munsell_0.5.0       
## [40] fansi_0.4.1          viridis_0.5.1        lifecycle_0.2.0     
## [43] pROC_1.16.2          yaml_2.2.1           MASS_7.3-53         
## [46] plyr_1.8.6           recipes_0.1.14       grid_4.0.3          
## [49] blob_1.2.1           parallel_4.0.3       crayon_1.3.4        
## [52] lattice_0.20-41      haven_2.3.1          splines_4.0.3       
## [55] hms_0.5.3            tmvnsim_1.0-2        knitr_1.30          
## [58] pillar_1.4.6         stats4_4.0.3         reshape2_1.4.4      
## [61] codetools_0.2-16     reprex_0.3.0         glue_1.4.2          
## [64] evaluate_0.14        data.table_1.13.2    modelr_0.1.8        
## [67] vctrs_0.3.4          foreach_1.5.1        cellranger_1.1.0    
## [70] gtable_0.3.0         assertthat_0.2.1     xfun_0.18           
## [73] gower_0.2.2          prodlim_2019.11.13   broom_0.7.2         
## [76] e1071_1.7-4          class_7.3-17         survival_3.2-7      
## [79] viridisLite_0.3.0    timeDate_3043.102    iterators_1.0.13    
## [82] lava_1.6.8           ellipsis_0.3.1       caret_6.0-86        
## [85] ipred_0.9-9